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HISTORICAL PERSPECTIVE ON THE APPLICATION OF THE 
RASCH MODEL .IN THE P6RTLAND METROPOLITAN AREA 

t 



The ^ork described in this monograph series grew from 
20 years 'o£ cooperative projects in Oregon and Washington. 

The first step in a cooperative, area-wide testing pro- 
gram began in fall, 1957, when Victor Doherty and George 
Ingebo developed local norms for a high school testing, program 
in the Portland Public Schools. These local norms were reported 
as standard scores (a mean of 50 and standard deviation of 10) — 
a significant advance over grade equivalent and percentile 
scores. The following year, through the leadership of Bernice 
Tucker, this local norming program was adopted by the Multnomah 
County Intermediate Education District and made available to 
districts throughout the country. In 1960, the three counties, 
in the Portland area established the Metropolitan Area Test 
Program Boaurd, to make the new program available and provide 
cooperative planning for new testing programs in Clackamas, 
Multnomah, and Washington counties. In addition to increasing 
efficiency and reducing costs of current programs, the founda- 
tion of HATPB made it possible to establish- comprehensive 
up-to-date NorthVest norms considerably more relevant than 
those provided by test publishers. 

At this time, the MATPB program was extended, to include 
reading and mathematics in the elementary grades, three through 
eight. At these levels, tests were developed, field tested, 
revised, formated and printed by MATPB committees, providing 
MATPB control over the content of test^ as well as the norming 
base. An important innovation in these tests was the even 
distribution of^ items at the low and high ends which' provided- 
valid measurtttient for the most-and least-eUale students, as 
opposed to publishers' tests, which are heavily weighted toward 
items of average difficulty. At this time, MATPB was so suc- 
cessful that it involved districts comprising two-thirds of 
Oregon's students. 

By 1966, the desire for more flexible testing led to the 
Computer Based Testing Project (COMBAT) with Teaching Research 
as the supervising agency. As part" of this project, teachers 
wrote thousands of test questions which were entered into a 
computer file. Each question was tagged by a behavioral ob-. 
jective so that teachers could phone in one or more key words 
auid receive a ditto master of all relevant items in onie or two 
days. Although COMBAT was short lived, it provided valuable 
experience for future item bank efforts. It demonstrated the 
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need for careful validation, through field testing, of the 
items entered into the bank as well as the need for an organized 
structure to describe the content covered by each item. 

^ 

In 1.970, DrT vStor Doherty recognized the need to develop 
a comprehensive system 'for detailing content in school subjects. 
This led to the formation of the Tri-County Course Goals Project 
^ince beginning work in 1971, this group has published collec- 
tions of coxirse and pro-am goals in twelve subjects — language 
arts, mathematics, science, social studies, art, music, physical 
education., health, industrial education, business education, 
home economics and second language — covering grade levels 

n 

In the late-Sixties, the desire to establish a more flex- 
ible base for test development inspired ihtereat in the Rasch 
model. For *e:^ample» provisions of the^then new Title I pirogram 
stipulated testing the least eUdle students (most functioning 
two or more years behind normal) and yet out-of -level testing 
as promoted by .several test publishers produced unsatisfactory 
results. Motivated by an article written by Benjamin Wright, 
Peter Wolmut and James Beaird attended a conference organized 
by Dr. Wright to introduce the Rasph model. Based on their 
repo;rts, and the .results of initial data analysis, the Rasch 
model appeared, to provide the flexibility that was needed. 

In 1973, a meeting of task forces in Washington and Oregon 
led to the foundation of the Northwest Evaluation Association 
(NWEA) whose mission was to develop go^il referenced item banks 
in all school subject's. Simultaneously, Dr's Ingebo, Porster 
and Forbes, began carefully researching the important properties 
of the Rasch model. With th^ help of the Office of the Superin- 
tendent of Public Instruction in Washington and the Northwest 
Regional Educational Laboratory in Portland, NWEA sponsored 
Rasch conferences in February and September, 1974* At the same 
time, the NWEA reviewed availaUsle sources of it«ns and began 
field testing to develop its initial item banks in reading 
and mathematics. Not unexpectedly, the complications involved 
in field testing coupled with those of applying and expanding 
the Rasch methodology were sufficient to delay publishing the 
initial versions of the mathematics and reading item banks 
until Spring 1977. In the intetim, the foundation of using 
the banks was laid in several NWEA workshops covering the 
techniques for applying the Rasch, the chajracteristics of 
the NWEA Rasch scales and the characteristics to be built into 
the item' banks. 

7he Northwest Evaluation Association has made Rasch cali- 
brated item banks available in reading and mathematics, and. 
language arts.. Work on the development of item banks in 
^other- subjects has already begun, and will be an important 
^activity of the Association for the next several years. 



I 



VOL. 1 MONOGRAPH II 

* . , ' 

CAN RASCH I1*EM LEVELS BE DETERMINED WITHOUT RANDOM SAMPLING? 

Backgrotiiid ' • 

. Historicaliy, the difficulty .of . a test question has been 
tied to the performance of a specific group on that questior. 
For example/ we might determine an item has a difficulty 
p-value of 75% (correct) when taken a particular group oi; 
students. The problem with this approach is that we must 
specifiy a "comparison" group (such as suburban fifth graders) 
and are neVer quite 'sure how tha item might work with a dif- 
ferent group. For this reason,, those bopcerned with developing 
good tests have attempted to use random samples to insure the 
p-value for a test itam is representative of the comparison 
'population. -Unfbrtunately, this course is fraught with diffi- 
culty. If students are sampled randomly,, they must be pulled . 
away from their regular classwork and school routine is likely 
to be disrupted. If volunteers are used in the sample, the 
question arises as to what might have happened if the holdouts 
had been' included. On the other hand, if teachers and principals 
are forced to participate, there is no way to judge how carefully 
they may have followed the proper standardized procedures or 
prep^ared students forithe tests. In short, random sampling 
(or almost any sampling) , has serious limitations . 

t 

For this reason, we want'ed to explore the Rasch test 
model which siipposedly eliminates the need for randon sampling. 
The Rasch approach makes this possible by using p-values and 
student scores to determine the "true" difficulty level' ofeach 
item and the "true" achievement level of each student on an 
underlying curriculum scale. Going beyond this, Ben Wright 
(the father of the Rasch model in the U.S.) made what seemed 
to be a preposterous claim; that an item would be calibrated 
to the same level regardless of the students used in field 
testing. Our doubts about this claim launched the research 
described in this monograph. 

« 

The Research Question: Can the Rasch Level on an 

Item be Determined Without 
Random Sampling? - 



The analysis was based on student responses to standard- 
ized tests in fourth grade reading (approximately 1400 students) , 
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We divided the student responses in two- ways: CX) students 
scoring above the average score vs. students scoring below- 
the average, and (2) students from inner-city schools vs. 
students from the more advantaged schools. Then we calculated, 
the Rasch item difficulty levels separately for each of the - 
groupings, and for the total group of 1400 students. Finally/ 
we correlated the Rasch item levels as calibrated for the 
different groups against the total' group to determine the 
degree of match. . * * 

Results . • 

The results of our research are shown in Figure 1 . • In 
addition to the correiatioh between item levels (which insured 
they are in the same order) , it is also important to examine 
the ratio of the standard disviations (which insures they are 
of the .same magnitude) . These two values can be combined 
to ^orm i "restricted" correlation; i.e., one in which the 
pairs of item levels are required to have identical values 
as well as to be in the same order. As can be seen from . 
the figures, the results indicate that the item levels agreed 
quite well, even in tl\p case of the extreme split of students 
' adsove and below the average. 

Replication 

This research was repeated pn approximately 4000 students 
at fourth grade in reading and mathematics and 4000 students 
at eighth grade in reading and mathematics. As shown in Figuri 
the replication confirmed the results of the original research 

Conclusions 

^ Based on this research, it appears that random samples 
ate not needed to calibrate item levels in reading and mathe- 
matics. 
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Figure 1 



.Comparison of Rasch Item Levels Calibrc^ted pn Student Sul^groups 
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Correlations and Restricted Correlations* of Item Levels for Subgroups and the Total Group! 



• 

»' .r 




t 




^ S U B J E 


C T 










FOURTH READING 


FOURTH MATH i 


EIGHTH READING 


EIGHTH MATH- 


GROUP 


1 

T I 
* I 


Restricted 

r 


* 

r 


Restricted 

*• 

r 


r- .| 


' Restricted 
r 


r 


• Restri^cteci' 
r 


~Ab6ve the 
Average 
vs • Total 


.99 


* .83 


.98 


.^8 


.98 


.75 


.97 


.80 

* « 


Below the 
Average 
vs* Total 

ft 


.96 


.81 


.97 


.89 . ' 


.98 


.81 


.95 

t 

f 


.85 

i 


Inrter-City . 
Schools' 
vs. Total 


1.00 

• 


.96 


.99 


.98 


1.00 


.96 


1.00 


V 

.99 


Top Schools 
vs. Total 


1.00 


.95 


.99 


.97 


1.00 


.97 


I.OO 
* 


.98 ' 

• 


Inner-City 
Schools 
vs. Top Schools 


.99 


.92 


.98 


.95 


.99 


.92 


.98 


.96 



7 



ERIC 



8* 



in magnitude. 

«Based on approximately 4000 students for each grade and subject. 
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•* * 

IS THE RASCH ITEM LEVEL- ^CALE EQUAL INTERVAL? 
Introduction . 



One of the claims' made for the ;Rasch taod6l is that its 
items are measured x)n an equal interval^^^ale . This is im- 
portant since making comparisons amon^ traditional measures 
of item difficulty is a shaky, affair. The*p-value .^percent 
correct), for example, varies significantly for 'difz|rent 
student' grotip 8. A test item may have a p -value of 85% for * 
•a high leV^l fourth grade group and 457. for an inner-city 
fifth grade toroup. Whi-ch of these is the "correct" dif- 
ficulty level? 

✓ * * * * 

After' 'sqroe consideration, the reason for this confussion 
is clear.* P- values are tied to a specific group rather, 
t^'an the underlying curriculum'. Fpr this reason Georc Rasch 
proposed his model to .transform p-values into an egual interval 
item level based on the underlying leatning scale. * , 



The Research Question: Is the Rasch liem t^vel 
' ' Scale Equal Interval? 



Background 

. When the Rasch item levels are calculated fpr a test, 
they are centered)' on the average level for that test. Each 
calibration of an item is the same except f6r the correcticm-;. 
needed to reflect the average level for the test; For exami 
if one test has an average level ten points above that for 
another, then an item calibrated at the 180 level on the first 
test Vould be expected to be calibrated at the 190 level orpthe 
second 6est. . • / 

By including several of the same items on both of the 
pao tests, the avei^age ^^ifference of the calibrations 
provides an unbiased' estimate of the difference in the average 
.-^level of the two tests. The equal interval nature of the 
Rasch scale cjtn be checked by the consistency of the linking . 
values among several tests. 

Method ■ . 

^^^^^^^ 

, , - I 

• To test this '•research queq^tion,, a po6l of 250 previously' 
field tested seventh grade mathematic items were formated 
iit seven tests.. The items were divided into two groups: 
an easier group which had p-va lues over 20% and a difficult. 
gr6up which had p-values of 20% or lower. . The easier gnoup 
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of itemii'wa's formated into four sixty-item tests of grad- i 
uated difficulty (forms W, X, .These tests were 

r* constructed so that twenty items were shared between 
adjacent levels. The difficult group of items were formated 
into three tests (forms D, E, P)'. Th6se tests were linkfed 
to each other aih well as to forms W and Z (see Figure 1) 

The linking values between the tests* iire summarized in 
Figure 2 . Note, that a different block of common it6ms *. 
was used between each pair of forms. The linking value ^ 
was calculated by subtracting ihe average calibration for 
a block of items on the first test and the Second test. 
Pigxire 3. shows j^he composite values for the links from 
W €o 2 through two independent pathways. (W-X-Y-z and 
'w-D-Eh-?-2) . The i^two pathways led to a discrepancy of 
1.1 Rits or 4.4% of the total linking value. 

« .• 

Conclusions 

These findings support the contention > that Rasch item 
' levels are equal interval. « 
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Figure 1 



Linking Network Among Tests 




Capital letters designate test forms. 



Lower case letters* designate linking items.. 
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DO STUDENTS RECEIVE THE SAME RASCH 
ACHIEVEMENT LEVEL FROM DIFFERENT TESTS? 



Background 

Traditional methods £or equating tests are complicated 
and unreliable for two reasons. F^rst, there are too many 
variables to hamdle (dif ff^rences in test length, test level, 
and test range) . Second, they are "based on sample dependent 
statistics which shift dramatically for different groups. 



The Rasch model, on the other haim, circi:uiivents these 
problems, by scaling .each test on the same underlying 
curriculum scale of student achievement. o The purpose of this 
this research study was to verify this characteristic of 
the Rf^sch model. 



The Re'search Question: Do Students Receive the 

Same Achievement Level 
from Different Tests? 



Method 

In fall.ld77, two reading tests were given to students 
in grades three through eight. The first was a thirty-item 
field test, and the second was an 80 to 100-item standardized 
survey test in reading. Since two different field tests 
were used at all grade levels except three and five, there 
were ten independent comparisons in the study. 

Following field testing, the field tests and the survey 
tests were linked to the Northwest Evaluatioi Assoication Rasch 
Reading Scale. After eliminating low quality items from the 
analysis,' each student's achievement was scaled separately 
for his performance on the field test and on the survey test. 
The resulting pairs of raw scores and achievement levels for 
each student were then averaged to make the required, comparison. 

Results 

As shown in Figure 1, the Rasch atrerages agree quite 
closely fox: nearly all the test pairings. In thpse instances 
where the difference was more than one Rasch unit, it was 
found that the field te^t was poorly matched to the I'e-^el 
of the students in the sample and "ceiling" or "floor" esti- 
mates were introduced inadvertently. 
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Replication 

This study was re^icated on a more extensive basis " 
in fall 1977. Each student in grades four through eight took 
two short achievment testvs in reading and in mathematics. The 
first test, the competency progress test (CP) , was administered 
on a grade level basis, while the second test, the achievement 
level test (AL) , was assigned to the student based on pre^* 
vious measures of his achievement. The tests were scaled 
independently and a pair of raw scores and achievement levels 
were calculated in each subject for each' student as in the 
original study. As shown in Figure 2, the results of the 
replication confirmed the findings from the previous study. 
(It should be noted that this comparison is based on a single 
form of the CP test, but up to ten different AL forms at 
each grade level.) In those cases where the averages for 
the two tests differ by more than a point, it was found that 
the raw score distribution of the CP test was severly truncated 
at either the top or bottom end indicating the inappropriatenesa 
of that test for miich of the student saiMple. 

fl 

Conclusions 

These results lead us to conclude that students do receive 
comparable achievement levels from different tests, provided 
that neither test is at a grossly inappropriate level* 
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Figure 1 

Comparability of Rasch Achievement Levels 
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Comparison o£ Results Between the qon.petency Progress 
and Achievement Level Test Programs! 
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WHAT IS THE SMALLEST. SAMPLE SIZE NEEDED FOR FIELD TESTING? 



Background 

To field test the many items needed in building a compre- 
hensive item bank, it is important to take full advantage of 
the limited number of participating students and teachers by 
using the smallest sample which will yield reliable item level 
calibrations. This research study was intended to address 
this practical ^ issue . 



^ The Research Question: What is the Smallest Sample , : 
I . Size Needed for Field Testing? j 



Method 

i 

V 

In the fall of 1976, approximately 1400 students responded 
to a fourth grade mathematics test and the same number re- 
sponded to an eighth grade reading test . A computer program 
was developed to randomly select five samples each of sizes 
50, 100, 200, and 300. The calibrations for the five samples 
of a given size were then correlated with those for the total 
group of 1400 students. Figure 1 shows these correlations, 
together with the ratio of the standard deviations of the 
calibrations (which should be 1 if the metrics are equal), 
and the average discrepancy which is the average of the absolute 
value differences between the calibrations for the samples 
of a given size and the total group. 

Results 

Based on the data in Figure 1, we concluded that a sample 
size of 200 provided nearly as accurate information as 300, 
and yet was significantly more accurate than lower sample 
sizes. 

Replication 

Using the responses of approximately 3800 students at ' 
fourth grade and seventh grade to a reading and a mathematics 
test, five random, samples were-dr^wn of each of the sizes 
50, 100, 150, 200, 250, and 300. The results of the corre- 
lations between the calibrations for the five samples of 
a given siz€i and the total group are shown in Figure 2. Based 
on these data we again' concluded that a seunple size of 200 
appear^ to maximize the information available from a limited 
field test pof>ulation. 
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Conclusions 

Based on the rssearch, we use 200-300 students in field 
testing new items £or the NWEA item banks. 
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Figure 1 • 
Fourth Grade Mathepiatics and Eighth Grade Reading l,974-r975 
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#The average absolute value difference in ths calibrations. 

$Tlte ratio of the standard deviations of the two sets of calibrations 
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Fiqute 2 

9 

Fourth and Seventh Grade Reading and Mathematics 1977 
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*En tries represent five samples drawn of each size. 

iThe average absolute value difference in the calibrations. 
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V 



IS THE tALIBRATION OF AN ITEM AFFECTED 
py THE RANGE OF ITEMS ON A TEST? 



Backgroiind 

Every IteR^is calibrated in the context of a test. This 
led to the question of hdw the calibration for an 'item might 
shift if the range of the test were restricted. Thr piu;pos.e 
of this study was to euiswer that question.' 



The Research Question: the Calibration of an • 

Item Affected by the Range 
of Items on a Test? - 



Method 

Approximately 1400 student responses to an eighty-item 
fourth grade standardized survey test were used in this stud^. 
Fir9t, the 'total test was calibrated. Then, as shown in Figure 
1, the ten 'highest level and ten lowest level items were 
dropped to. yield a sixty-item subtest. Next, the five highest 
and five lowest items were successively dropped to yield sub- 
tests ot 50, 40, 30, '20, and 10 items of decreasing rantre of 
item levels. By this procedure, the items were calibrated 

.on several different subtests, the ten middle level items being 

'calibrated on all the subtests. 

The calibration^ for the items were then correlated 
'across subtests to identify any sliifts or inconsistencies. The 
■Correlations, which were essentially 1.00, were then further 
restricted * to require the equality of metrics between the sub- 
tests. The re9ults of this analysis are shown in Figure 2. 



Conclusions 

Ba.3ed on these data, it appears that the calibration of 
an ,item is not significantly affected by the range of item 
levels on a test. 
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• • Figure 1 

Comparison of Item Levels for Subtests of Varying 
Ranges and Difficulty. Drawn from a Fourth Grade 

Mathematics Test 
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Figure 2' 
Correlations Among Subtests* 
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♦Restricted to require equal metrics as well as equal order ings. 



